Skip to content

SwiGLU MLP: parameter-neutral gated activation over LeakyReLU^2#676

Open
they-call-me-god wants to merge 1 commit intoopenai:mainfrom
they-call-me-god:swiglu-submission
Open

SwiGLU MLP: parameter-neutral gated activation over LeakyReLU^2#676
they-call-me-god wants to merge 1 commit intoopenai:mainfrom
they-call-me-god:swiglu-submission

Conversation

@they-call-me-god
Copy link

Summary

Replace LeakyReLU(0.5)^2 with SwiGLU gating — the same multiplicative activation used in LLaMA, Mistral, Gemma, and PaLM.

Built on the PR #549 SOTA stack (LeakyReLU² + Legal TTT + Parallel Muon). Single change, zero parameter increase.

The Change

# Before (SOTA)
x = F.leaky_relu(F.linear(x, up_w), negative_slope=0.5).square()
out = F.linear(x, down_w)

# After (SwiGLU)
half = up_w.shape[0] // 2
gate = F.silu(F.linear(x, up_w[:half]))   # learned gating
up   = F.linear(x, up_w[half:])
out  = F.linear(gate * up, down_w)

Parameter Neutrality

Bank Old shape New shape
mlp_up_bank[i] (1536, 512) (2048, 512) — gate||up
mlp_down_bank[i] (512, 1536) (512, 1024)

Proof: 2 × 512 × 1536 = 3 × 512 × 1024 = 1,572,864 per layer.

Status

Training logs pending (RunPod 8×H100). Will update with 3-seed results and final val_bpb.

New env vars

  • USE_SWIGLU=1 (default on)
  • SWIGLU_HALF_DIM=1024

Replace LeakyReLU(0.5)^2 with SwiGLU (silu gate * up projection).
Same parameter count: 3*512*1024 = 2*512*1536 = 1,572,864 per layer.
All other SOTA settings preserved (TTT, Parallel Muon, int6+lzma, etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant